Select an area and download the data from OpenStreetMap.
Audit the data, investigate data using SQL and Pandas.
West hollywood, CA, United States
This is a place where a lot of celebrities live, and I am interested in how do people live their life there. The data source is OpenStreetMap. The following is an overview of the map.
from IPython.display import HTML
HTML('<iframe width="425" height="350" frameborder="0" scrolling="no" marginheight="0" marginwidth="0" src="http://www.openstreetmap.org/export/embed.html?bbox=-118.3918%2C34.0754%2C-118.3320%2C34.0996&layer=mapnik"></iframe><br/>')
After downloading the data and studying a small sample size of the this area, I noticed the following problems with the data:
I will deal with these problems in the above order.
All postal codes are summarized and the result is saved as a dictionary, with the value being its count of occurrence.
code
{'90004': 1,
'90027': 1,
'90028': 4,
'90036': 10,
'90038': 5,
'90046': 124,
'90046-4101': 1,
'90048': 9,
'90069': 612,
'90210': 47,
'CA 90036': 1,
'CA 90048': 1,
'CA 90069': 2}
I proceed data cleaning in 2 steps.
Step 1. standardize the format to 5-digit string.
Step 2. add the city name behind the 5-digit postcode.
In the second step, I match the city name with the postal codes by parsing the website http://www.zipcodestogo.com/California/. The result shows that West Hollywood is indeed the area of our focus. Nodes from other areas are included, because we had to draw a rectangular when selecting the desired area on OpenStreetMap.
code
{'90004': '90004, Los Angeles',
'90027': '90027, Los Angeles',
'90028': '90028, Los Angeles',
'90036': '90036, Los Angeles',
'90038': '90038, Los Angeles',
'90046': '90046, Los Angeles',
'90046-4101': '90046, Los Angeles',
'90048': '90048, Los Angeles',
'90069': '90069, West Hollywood',
'90210': '90210, Beverly Hills',
'CA 90036': '90036, Los Angeles',
'CA 90048': '90048, Los Angeles',
'CA 90069': '90069, West Hollywood'}
The desired format for the phone number is +1-213-224-4153, that is,
However, after querying the data, we find the following inconsistent formats:
{'(323) 464-2989': 1,
'(323) 654-4411': 1,
'(323) 654-6686': 1,
'(323) 874-6700': 1,
'(323) 934-1121': 1,
'+1 (310) 980 8064': 1,
'+1 (323) 857-1882': 1,
'+1 (323) 928-3002': 1,
'+1 (323) 944-0855': 1,
'+1 310 3600916': 1,
'+1 310-652-5091': 1,
'+1 310-854-3488': 1,
'+1 323 654 7125': 1,
'+1 323 6548271': 1,
'+1 323 747 1388': 1,
'+1 323-654-4222': 1,
'+1 323-654-8713': 1,
'+1- 323-978-2170': 1,
'+1-1-310-424-1600': 1,
'+1-323-650-057': 1,
'+1-800-PINKDOT': 1,
'+13239931700': 1,
'0013239366154': 1,
'013239514800': 1,
'1-323-871-8318': 1,
'1-323-874-7924': 1,
'310 289 2000': 1,
'310-720-3809': 1,
'323-379-2091': 1,
'323-415-6860': 1,
'323-424-7731': 1,
'323-540-4551': 1,
'323-697-5338': 1,
'323-931-1466': 1,
'323-937-2801': 1,
'323-988-1119': 1,
'855-888-5575': 1}
These bad formats may belong to one or several of the following categories:
I will concentrate on (1)-(6). Notice that (8) is of correct format, and (7) could not be dealt with without further information. After cleaning, we have:
{'(323) 464-2989': '+1-323-464-2989',
'(323) 654-4411': '+1-323-654-4411',
'(323) 654-6686': '+1-323-654-6686',
'(323) 874-6700': '+1-323-874-6700',
'(323) 934-1121': '+1-323-934-1121',
'+1 (310) 980 8064': '+1-310-980-8064',
'+1 (323) 857-1882': '+1-323-857-1882',
'+1 (323) 928-3002': '+1-323-928-3002',
'+1 (323) 944-0855': '+1-323-944-0855',
'+1 310 3600916': '+1-310-360-0916',
'+1 310-652-5091': '+1-310-652-5091',
'+1 310-854-3488': '+1-310-854-3488',
'+1 323 654 7125': '+1-323-654-7125',
'+1 323 6548271': '+1-323-654-8271',
'+1 323 747 1388': '+1-323-747-1388',
'+1 323-654-4222': '+1-323-654-4222',
'+1 323-654-8713': '+1-323-654-8713',
'+1- 323-978-2170': '+1-323-978-2170',
'+1-1-310-424-1600': '+1-310-424-1600',
'+1-323-650-057': '+1-323-650-057',
'+1-800-PINKDOT': '+1-800-PINKDOT',
'+13239931700': '+1-323-993-1700',
'0013239366154': '+1-323-936-6154',
'013239514800': '+1-323-951-4800',
'1-323-871-8318': '+1-323-871-8318',
'1-323-874-7924': '+1-323-874-7924',
'310 289 2000': '+1-310-289-2000',
'310-720-3809': '+1-310-720-3809',
'323-379-2091': '+1-323-379-2091',
'323-415-6860': '+1-323-415-6860',
'323-424-7731': '+1-323-424-7731',
'323-540-4551': '+1-323-540-4551',
'323-697-5338': '+1-323-697-5338',
'323-931-1466': '+1-323-931-1466',
'323-937-2801': '+1-323-937-2801',
'323-988-1119': '+1-323-988-1119',
'855-888-5575': '+1-855-888-5575'}
We could observe that some of the street names are either abbreviated or mistyped,
{'Ave': set(['N La Brea Ave']),
'Blvd': set(['7290 Beverly Blvd',
'North Robertson Blvd',
'Santa Monica Blvd',
'Sunset Blvd']),
'Blvd.': set(['North Robertson Blvd.', 'Santa Monica Blvd.']),
'avenue': set(['north la brea avenue']),
'blvd': set(['sunset blvd'])}
The following cleanings are implemented:
sunset blvd => sunset Boulevard
North Mansfield => North Mansfield
Sunset Blvd => Sunset Boulevard
Santa Monica Blvd => Santa Monica Boulevard
7290 Beverly Blvd => 7290 Beverly Boulevard
North Robertson Blvd => North Robertson Boulevard
north la brea avenue => north la brea Avenue => North la brea Avenue
N La Brea Ave => N La Brea Avenue => North La Brea Avenue
North Robertson Blvd. => North Robertson Boulevard
Santa Monica Blvd. => Santa Monica Boulevard
Size of the file ....... 58,913 KB
nodes.csv .............. 23,462 KB
nodes_tags.csv ......... 293 KB
ways.csv ............... 1,595 KB
ways_tags.csv .......... 6,761 KB
ways_nodes.cv .......... 4,979 KB`
My analysis consists of two parts.
Part 1: map data analysis
Part 2: user data analysis
I use SQL quries to present basic statistics, and discuss some ideas based on the observations.
To facilitate analysis, I will first create an outer union of the following two tables using sql: nodes_tags and ways_tags.
cur.executescript('''
CREATE TABLE tags_nodes_ways (
id INTEGER,
key TEXT,
value TEXT,
type TEXT,
FOREIGN KEY (id) REFERENCES nodes(id)
);
INSERT INTO tags_nodes_ways
SELECT * FROM nodes_tags
UNION ALL
SELECT * FROM ways_tags;
''')
cur.execute('''
select A.value, count(*) as num
from tags_nodes_ways as A
where A.key == 'amenity'
group by A.value
order by num desc
;
''')
pprint.pprint(cur.fetchall())
cur.execute('''
select tags_nodes_ways.value, count(*) as num
from tags_nodes_ways
join (select distinct(id)
from tags_nodes_ways
where value == 'place_of_worship') A
on A.id == tags_nodes_ways.id
where tags_nodes_ways.key == 'religion'
group by tags_nodes_ways.value
order by num desc
;
''')
pprint.pprint(cur.fetchall())
Sort Restaurants by count.
Observations:
cur.execute('''
select tags_nodes_ways.value, count(*) as num
from tags_nodes_ways
join (select distinct(id)
from tags_nodes_ways
where value == 'restaurant' or value == 'fast_food') A
on A.id == tags_nodes_ways.id
where tags_nodes_ways.key == 'cuisine'
group by tags_nodes_ways.value
order by num desc
;
''')
pprint.pprint(cur.fetchall())
cur.execute('''
select tags_nodes_ways.id, tags_nodes_ways.key, tags_nodes_ways.value,
count(*) as num
from tags_nodes_ways
join (select distinct(id)
from tags_nodes_ways
where value == 'bank') A
on A.id == tags_nodes_ways.id
where tags_nodes_ways.key == 'name'
group by tags_nodes_ways.value
order by num desc
;
''')
pprint.pprint(cur.fetchall())
cur.execute('''
select tags_nodes_ways.value, count(*) as num
from tags_nodes_ways
where tags_nodes_ways.key == 'postcode'
group by tags_nodes_ways.value
order by num desc
;
''')
pprint.pprint(cur.fetchall())
cur.execute('''
select A.value, count(*) as num
from tags_nodes_ways as A
where A.key == 'leisure'
group by A.value
order by num desc
;
''')
pprint.pprint(cur.fetchall())
cur.execute('''
select A.value, count(*) as num
from tags_nodes_ways as A
where A.key == 'shop'
group by A.value
having num >= 3
order by num desc
;
''')
pprint.pprint(cur.fetchall())
It seems that a lot of people here for vacation:
cur.execute('''
select A.value, count(*) as num
from tags_nodes_ways as A
where A.key == 'tourism'
group by A.value
having num >= 3
order by num desc
;
''')
pprint.pprint(cur.fetchall())
From the results, I notice that the same data entry often appears in different names. I could give several examples.
Example 1: source of data.
Bing appears in the following forms: u'TIGER, Bing, u'Bing, u'Bing; TIGER 2012, u'bing, u'Tiger2011; Bing.cur.execute('''
select tags_nodes_ways.value, count(*) as num
from tags_nodes_ways
where tags_nodes_ways.key == 'source'
group by tags_nodes_ways.value
order by num desc
;
''')
pprint.pprint(cur.fetchall())
To obtain correct number of source of Bing, the following code works:
cur.execute('''
select tags_nodes_ways.value, count(*) as num
from tags_nodes_ways
where tags_nodes_ways.key == 'source' and tags_nodes_ways.value like '%bing%'
order by num desc
;
''')
pprint.pprint(cur.fetchall())
Example 2: name of coffee shops.
cur.execute('''
select *
from (select tags_nodes_ways.id, tags_nodes_ways.key, tags_nodes_ways.value,
count(*) as num
from tags_nodes_ways
join (select distinct(id)
from tags_nodes_ways
where value == 'cafe') A
on A.id == tags_nodes_ways.id
where tags_nodes_ways.key == 'name'
group by tags_nodes_ways.value
order by num desc) B
where B.value like '%Starbucks%'
;
''')
pprint.pprint(cur.fetchall())
Example 3: name of shops.
7-eleven appears as u'7-Eleven', u'7 Eleven', and u'7-eleven'.cur.execute('''
select *
from (
select tags_nodes_ways.id, tags_nodes_ways.key, tags_nodes_ways.value,
count(*) as num
from tags_nodes_ways
join (select distinct(id)
from tags_nodes_ways
where key == 'shop') A
on A.id == tags_nodes_ways.id
where tags_nodes_ways.key == 'name'
group by tags_nodes_ways.value
order by num desc
) B
WHERE instr(B.value, '7') > 0
;
''')
pprint.pprint(cur.fetchall())
Wrong numbers lead to inaccurate statistics. It would be helpful if a standard naming convention for the mostly used names is provided to the users for reference.
cur.execute('''
select count(*)
from nodes;
''')
pprint.pprint(cur.fetchall())
cur.execute('''
select count(*)
from ways;
''')
pprint.pprint(cur.fetchall())
cur.execute('''
select count(distinct(A.user))
from (select user from nodes
union all
select user from ways) A
;
''')
pprint.pprint(cur.fetchall())
cur.execute('''
select A.user, count(*) as num
from (select user from nodes
union all
select user from ways) A
group by A.user
order by num desc
limit 10
;
''')
pprint.pprint(cur.fetchall())
cur.execute('''
select count(*)
from (select A.user, count(*) as num
from (select user from nodes
union all
select user from ways) A
group by A.user
having num <= 3) B
;
''')
pprint.pprint(cur.fetchall())
We observe that contributions from users are highly skewed: 88.59% of the entries are contributed by the following users.
| username | contributions count | percentage |
|---|---|---|
| u'schleuss_imports' | 130738 | 46.54% |
| u'dannykath_labuildings' | 35912 | 12.79% |
| u'manings_labuildings' | 23468 | 8.35% |
| u'RichRico_labuildings' | 12544 | 4.47% |
| u'piligab_labuildings' | 11952 | 4.26% |
| u'Luis36995_labuildings' | 10488 | 3.73% |
| u'karitotp' | 9836 | 3.50% |
| u'yurasi_import' | 5556 | 1.98% |
| u'bdiscoe_imports' | 4349 | 1.55% |
| u'markbegbie' | 4007 | 1.43% |
| TOTAL | 248850 | 88.59% |
Most of the users names end with _imports or _labuildings (see Los Angeles County building import). This suggests that the data were actually imported rather than entered manually. We now verify this point.
If the data were imported, we would expect a large amount of data inflow within short amount of time. We now try to visualize the data by date.
# import plotly.plotly as py
# import cufflinks as cf
## interactive graphs.
cf.go_offline()
df_nodes_ways_ct.ymd.dt.date.iplot(kind='histogram',\
filename='cufflinks/basic-histogram')
We notice unusually high data contributions in the following periods: Jun and July 2016. I now look at the data contributions by day for each month respectively.
fig = plt.figure(figsize = (10,5))
plt.suptitle('Count of contributions by day', fontsize=16)
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2, sharey = ax1)
ax21 = ax2.twinx()
sns.set()
sns.set_style('dark')
sns.distplot(df_nodes_ways_ct[(df_nodes_ways_ct['year'] == 2016) & \
(df_nodes_ways_ct['month']==6)]['day'],
ax = ax1,
hist_kws={"histtype": "stepfilled", "linewidth": 0.5,
"alpha": 0.4, "color": "g"})
ax1.axhline( y = 0.36, color = 'g', linestyle = '--')
ax1.set(ylabel='relative frequency', title = 'month = 6')
sns.distplot(df_nodes_ways_ct[(df_nodes_ways_ct['year'] == 2016) & \
(df_nodes_ways_ct['month']==7)]['day'],
ax = ax2,
hist_kws={"histtype": "stepfilled", "linewidth": 0.5,
"alpha": 0.4, "color": "g"})
sns.distplot(df_nodes_ways_ct[(df_nodes_ways_ct['year'] == 2016) & \
(df_nodes_ways_ct['month'].isin([7]))]['day'],
ax = ax21, kde = False)
ax21.axhline( y = 15000 , color = 'g', linestyle = '--')
ax21.set(ylabel='absolute frequency', title = 'month = 7')
There are three days by which the user contribution is greater than 15000 (above the green dotted line).
I wonder for each of these three days, how many contributors there were.
fig = plt.figure(figsize = (10,5))
plt.suptitle('Count of contributions by users, > 15000 contributions', fontsize=16)
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2,sharey = ax1)
sns.set()
sns.set_context("notebook")
sns.countplot(x = df_nodes_ways_ct[(df_nodes_ways_ct['year'] == 2016) & \
(df_nodes_ways_ct['month']==6) & \
(df_nodes_ways_ct['day'].isin([7,18]))]['user'],
ax = ax1,\
facecolor = (0,0,0,0),\
linewidth=5,\
edgecolor=sns.color_palette("BrBG", 7))
ax1.set(ylabel='count of contributions', title = '7th, 18th of June')
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=50)
sns.countplot(x = df_nodes_ways_ct[(df_nodes_ways_ct['year'] == 2016) & \
(df_nodes_ways_ct['month']==7) & \
(df_nodes_ways_ct['day']==5)]['user'],
ax = ax2,\
facecolor = (0,0,0,0),\
linewidth=5,\
edgecolor=sns.color_palette("dark", 3))
ax2.set(ylabel=' ', title = '5th, July')
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=50)
From the left-hand graph, we notice that Schleuss is the only contributor for 7th, 18th of June: it seems that the user Schleuss imports data under the name Schleuss_imports and around 63090 entries were imported under the name Schleuss_imports; however, only 170 entries were under the name Schleuss. I think this is a clear sign of automatic versus manual work.
At last, I am intereted to see users are most active at which day in the week, and at which hours during a day.
fig = plt.figure(figsize = (10,5))
# plt.suptitle('Count of contributions by users', fontsize=16)
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
sns.set()
sns.set_context("notebook")
sns.countplot(y = df_nodes_ways_ct['weekday'], ax = ax1)
ax1.set(ylabel='weekday', title = 'User activity by weekday')
ax1.set_xticklabels(ax1.get_xticks().astype(int), rotation=40)
sns.countplot(y = df_nodes_ways_ct['hour'],
ax = ax2)
ax2.set(title = 'User activity by hour')
ax2.set_xticklabels(ax2.get_xticks().astype(int), rotation=40)
0: Monday. 1: Tuesday. 2: Wednesday. 3: Thursday. 4: Friday. 5: Saturday. 6: Sunday.
Most of the data were entered on Tuesday, which is followed by Saturday and Thursday. However, we have to take this result with caution given that most of the data is imported.
OpenStreetMap (OSM) is a collaborative project and maintained by volunteers. Hence, it makes sense to observe that most of the contributions are not made at working time, i.e., from 8:00 to 18:00.
During data auditing and descriptive map data analysis, we notice that consistency of data could be improved if there exists a standard which users can follow.
I think we could try to keep users involved by increasing collabrations among users. I suggest that OpenStreetMap could request the users to complete their profiles by providing some personal information. Based on this information and users' preference, OpenStreetMap could match people of similar traits and recommend that they work in group.